Generalized linear models exist to map different distributions into linear space.
Logistic regression is for binary outcome variables (e.g., closed/open, broke/working, dead/alive) and we are predicting the probability of the outcome.
Our coefficients might take on different meanings when we use a different distribution.
Predicted probabilities are likely the easiest way to interpret some models.
GLM is telling you exactly what it is within the title.
But, how is it generalized?
Remember how linear models really enjoy the whole Gaussian distribution scene?
The essential form of the linear model can be expressed as follows:
\[y \sim Normal(\mu,\sigma) \\ \mu = \alpha + \beta X\]
Not all data follows a Gaussian distribution. Instead, we often find some other form of an exponential distribution.
So, we need a way to incorporate different distributions of the DV into our model.
Distributions cannot do it alone!
From a theoretical perspective, link functions are tricky to get your head around.
From a conceptual perspective, all they are doing is allowing the linear predictor to “link” to a distribution function’s mean.
At the end of the day, these link functions will convert the outcome (DV) to an unbounded continuous variable.
The take-away here is that the link function describes how the mean is generated from the predictors.
Linear regression is not new to us.
A linear regression deals with real numbers between \(-\infty\) to \(\infty\) in the dependent variable.
It is the most vanilla within the GLM.
How do you find a mean from a normal distribution?
crimeLink <- "http://nd.edu/~sberry5/data/crimeScore.csv"
crimeScore <- data.table::fread(crimeLink)
lmTest <- glm(SSL_SCORE ~ WEAPONS_ARR_CNT,
data = crimeScore,
family = gaussian)
summary(lmTest)
Call:
glm(formula = SSL_SCORE ~ WEAPONS_ARR_CNT, family = gaussian,
data = crimeScore)
Deviance Residuals:
Min 1Q Median 3Q Max
-263.971 -29.971 0.029 30.029 183.029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 300.1126 1.0112 296.79 <2e-16 ***
WEAPONS_ARR_CNT 16.8585 0.7757 21.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2814.148)
Null deviance: 55206149 on 19146 degrees of freedom
Residual deviance: 53876862 on 19145 degrees of freedom
(379537 observations deleted due to missingness)
AIC: 206414
Number of Fisher Scoring iterations: 2
We see a lot of the same information, but we have a new model fit statistic to look at: deviance.
Deviance is badness. The null deviance is telling us how well our dependent variable is predicted by a model with only the intercept – this is generally the case with all null models.
Our model, captured in the residual deviance, showed a marked decrease in deviance with very little lost in degrees of freedom.
We can test it with the following:
pchisq(lmTest$null.deviance - lmTest$deviance,
df = lmTest$df.null - lmTest$df.residual,
lower.tail = FALSE)
[1] 0
We are testing a hypothesis that the difference between our null and model deviance is \(\chi^2\) distributed. By specifying that our lower tail is false, we are specifically testing a proportion of the distribution that is higher than the value that we found. This significant p-value tells us that our model is certainly better than a model without predictors!
As a final point about our output, the value of AIC is not really interpretable – it is used only for model comparison.
We can also produce our confidence intervals:
confint(lmTest)
2.5 % 97.5 %
(Intercept) 298.13065 302.09447
WEAPONS_ARR_CNT 15.33824 18.37886
Logistic regression is substantially different than linear regression.
Instead of that nice continuous dv, we are dealing with a binomially-distributed DV.
Our response takes the form of a binary variable.
We don’t have a \(\mu\) or \(\sigma^2\) to identify the shape of the distribution; instead we have p and n, where p is a probability and n is the number of trials.
We tend to talk about p with regard to the probability of a specific event happening (heads, wins, defaulting, etc.).
Let’s see how the binomial distribution looks with 100 trials and probabilities of .25, .5, and .75:
library(ggplot2)
library(dplyr)
binom.25 <- dbinom(1:100, size = 100, prob = .25)
binom.5 <- dbinom(1:100, size = 100, prob = .5)
binom.75 <- dbinom(1:100, size = 100 , prob = .75)
as.data.frame(rbind(binom.25, binom.5, binom.75)) %>%
mutate(prob = row.names(.)) %>%
tidyr::gather(., "key", "value", -prob) %>%
mutate(key = as.numeric(gsub("V", "", key))) %>%
ggplot(., aes(x = key, y = value, color = prob)) +
geom_line() +
scale_color_brewer(type = "qual", palette = "Dark2") +
theme_minimal()
If we examine the line for a probability of .5 (the orange line titled “binom.5”), we will see that it is centered over 50 – this would suggest that we have the highest probability of encountering 50 successes. If we run 20 trials and the outcome is 50/50, we would expect to see success in half the trials and a decreasing number of trials for more or less successes. Shifting our attention to a .75 probability of success, we see that our density is sitting over 75.
Since we are dealing with a number of trials, it is worth noting that the binomial distribution is a discreet distribution. If you have any interest in knowing the probability for a number of success under the binomial distribution, we can use the following formula:
\[P(x) = \frac{n!}{(n-x)!x!}p^xq^{n-x}\]
While we don’t need to dive too deep into finding those specific values for the binomial distribution, we can spend our time exploring how it looks in linear model space:
\[y \sim Binomial(n, p) \\ logit(p) = \alpha + \beta X\]
The logit function is defined as:
\[log\frac{p}{1-p}\]
We are literally just taking the log of the odds (the log odds becomes important later).
Now we can map this back to our model:
\[log\frac{p}{1-p} = \alpha + \beta X\]
And finally we can take that logistic function and invert it (the inverse-logit) to produce the probabilities.
\[p = \frac{exp(\alpha + \beta X)}{1 + exp(\alpha + \beta X)}\]
We will see more about how this happens after playing with the following model:
library(data.table)
kickstarter <- fread("https://www.nd.edu/~sberry5/data/kickstarter.csv")
kickstarter <- kickstarter[state == "successful" | state == "failed",
][, state := ifelse(state == "successful", 1, 0)]
# You can set the variable types individually:
# kickstarter$backers <- as.numeric(kickstarter$backers)
# kickstarter$goal <- as.numeric(kickstarter$goal)
# But the following is more efficient:
numericVars <- c("backers", "goal")
kickstarter[, (numericVars) := lapply(.SD, function(x) as.numeric(x)),
.SDcols = numericVars]
Let’s start with a visual breakdown of failure and success:
ggplot(kickstarter, aes(state)) +
geom_bar() +
theme_minimal()
And a table:
addmargins(table(kickstarter$state))
0 1 Sum
168221 113081 281302
If we take the number of successes and divide by the total, we will get the probability of being funded:
113081 / 281302
[1] 0.4019915
The probability of being funded is 0.4019915
While just knowing the probability is great, it does not provide much of a model. So, we can add predictor variables to model their relationship with the outcome variable.
logTest <- glm(state ~ backers, data = kickstarter,
family = binomial)
summary(logTest)
Call:
glm(formula = state ~ backers, family = binomial, data = kickstarter)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.4904 -0.6800 -0.6408 0.7450 1.8353
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4788736 0.0060867 -243 <2e-16 ***
backers 0.0264517 0.0001336 198 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 379089 on 281301 degrees of freedom
Residual deviance: 272617 on 281300 degrees of freedom
AIC: 272621
Number of Fisher Scoring iterations: 11
We are now dealing with log odds in the coefficients. We would say that for every unit increase in backers, the log odds of a campaign being funded is ~.03.
Let’s also visualize this and come back to it later!
ggplot(kickstarter) +
geom_count(aes(backers, state)) +
theme_minimal()
As its default, R produces the log odds for the regression coefficient in a logistic model.
Probability lies at the heart of all of this. We can look at the relationship between the probability, odds, and log odds.
probabilityList <- c(.001, .01, .15, .2,
.25, .3, .35, .4, .45,
.5, .55, .6, .65, .7,
.75, .8, .85, .9)
We have a list of probability values (always between 0 and 1). Now, let’s write a function to convert them to odds. We will use the \(\\p\, / 1 - p\) equation.
oddsConversion <- function(p) {
res = p / (1 - p)
return(res)
}
odds <- oddsConversion(probabilityList)
plot(probabilityList, odds)
Now, we can convert them to log odds:
plot(odds, log(odds))
We can clearly go back and forth between the 3, but the main message here is that we took a bounded variable in probability and transformed it to continuous space.
The intercept is offering the log odds of a campaign with 0 backers being funded – converting this back to probability (just adding the exponentiation into solving back to probability), we get the following:
exp(coef(logTest)["(Intercept)"]) / (1 + exp(coef(logTest)["(Intercept)"]))
(Intercept)
0.1855976
We are dealing with a pretty small probability (but certainly not 0) that a campaign with 0 backers could be successful.
Recall that the coefficient in log odds for the backers was .02645174.
Let’s start at the median value of backers:
median(kickstarter$backers)
[1] 15
Now, let’s solve our equation for that median value. This will produce the conditional logit.
medLogit <- logTest$coefficients["(Intercept)"] +
(logTest$coefficients["backers"] * median(kickstarter$backers))
names(medLogit) <- NULL
medLogit
[1] -1.082097
Let’s do the same thing for the next sequential value of backers:
medLogitPlus1 <- logTest$coefficients["(Intercept)"] +
(logTest$coefficients["backers"] * (median(kickstarter$backers) + 1))
names(medLogitPlus1) = NULL
medLogitPlus1
[1] -1.055646
Now we have two sequential conditional logits that we can subtract from each other:
backersCoef <- medLogitPlus1 - medLogit
backersCoef
[1] 0.02645174
This is exactly the coefficient that we got from the model. If we exponentiate the log odds, we are “unlogging” it to get the odds ratio that we saw earlier (just plain old odds at this point). This is sometimes referred to as the proportional change in odds.
exp(backersCoef)
[1] 1.026805
For every unit increase in backer, we have a 2% increase in the odds that the campaign will be successful. These odds are mostly stacking as we increase (see the plot below). When we look at the odds, anything above 1 is in favor of moving from the “0” category to the “1” category, whereas anything below 1 is in favor of not moving from the “0”. From a hypothesis perspective, this makes absolute sense – we would expect that having more backers contributes to a better chance of funding.
There are interesting issues at play here with regard to our predictor coefficients (what can be considered a relative effect) and the model’s effect as a whole on the probability (the absolute effect). In circumstances where the intercept is very large (essentially promising a success), the relative effect of a coefficient is practically meaningless. Similarly, very negative coefficients render the relative effects useless.
When we construct our linear model, we are trying to predict the response value – not the case with logistic regression. Instead we are trying to predict the probability of going from 0 to 1 (not funded to funded). So, normal plots that we might usually create with linear models will do us no good here.
First, we need to make our predicitions and put them back into the data:
kickstarter$predictedProbs <- predict(logTest,
type = "response")
The predict function is how we get the predictions from any model. Do note the type argument with the “response” specification; this ensures that we are using the type of response particular to the model.
Now, we can plot it:
ggplot(kickstarter[backers < 500], aes(backers, predictedProbs)) +
geom_line(size = 1.5) +
theme_minimal()
For our data, once you start to get above 250 backers, there is a probablity of 1 that it will be successful.
If we wanted to apply new data to our model, we would also use the predict() function, but give it new data.
Let’s add one more continuous predictor into the model.
twoPredictors <- glm(state ~ backers + goal, data = kickstarter,
family = binomial)
summary(twoPredictors)
Call:
glm(formula = state ~ backers + goal, family = binomial, data = kickstarter)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.4904 -0.6313 -0.0472 0.3570 8.4904
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.960e-01 7.037e-03 -127.3 <2e-16 ***
backers 5.106e-02 2.344e-04 217.9 <2e-16 ***
goal -2.053e-04 1.201e-06 -170.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 379089 on 281301 degrees of freedom
Residual deviance: 191135 on 281299 degrees of freedom
AIC: 191141
Number of Fisher Scoring iterations: 25
exp(twoPredictors$coefficients)
(Intercept) backers goal
0.4081851 1.0523893 0.9997948
Lets run that same model, but with some tweaks, and then do something cool!
library(plotly)
truncatedData <- kickstarter[backers >= median(backers) &
backers <= quantile(backers, .75) &
goal >= median(goal) &
goal <= quantile(goal, .75)]
summary(truncatedData)
ID name category
Min. :2.237e+05 Length:18833 Length:18833
1st Qu.:5.366e+08 Class :character Class :character
Median :1.074e+09 Mode :character Mode :character
Mean :1.075e+09
3rd Qu.:1.611e+09
Max. :2.147e+09
main_category currency deadline
Length:18833 Length:18833 Length:18833
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
goal launched pledged
Min. : 5000 Length:18833 Length:18833
1st Qu.: 5000 Class :character Class :character
Median : 7500 Mode :character Mode :character
Mean : 8374
3rd Qu.:10000
Max. :15000
state backers country
Min. :0.0000 Min. :15.00 Length:18833
1st Qu.:0.0000 1st Qu.:22.00 Class :character
Median :0.0000 Median :33.00 Mode :character
Mean :0.3444 Mean :34.68
3rd Qu.:1.0000 3rd Qu.:47.00
Max. :1.0000 Max. :62.00
usd pledged V14 V15
Length:18833 Length:18833 Length:18833
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
V16 V17 predictedProbs
Length:18833 Min. : NA Min. :0.2531
Class :character 1st Qu.: NA 1st Qu.:0.2897
Mode :character Median : NA Median :0.3530
Mean :NaN Mean :0.3672
3rd Qu.: NA 3rd Qu.:0.4414
Max. : NA Max. :0.5402
NA's :18833
twoPredictors <- glm(state ~ backers + goal,
data = truncatedData,
family = binomial)
backers <- unique(truncatedData$backers)
goals <- unique(truncatedData$goal)
grid <- expand.grid(backers,
goals)
newData <- setNames(data.frame(grid), c("backers", "goal"))
predictedVals <- predict(twoPredictors,
newdata = newData,
type = "response")
zMatrix <- matrix(predictedVals,
nrow = length(unique(newData$backers)),
ncol = length(unique(newData$goal)))
plot_ly() %>%
add_surface(x = ~goals, y = ~backers, z = ~zMatrix)
Let’s pick a few categories to work on.
catData <- kickstarter[main_category == "Film & Video" |
main_category == "Music" |
main_category == "Games",
.(state, main_category)]
Let’s look a the crosstabs for those variables:
addmargins(table(catData))
main_category
state Film & Video Games Music Sum
0 29653 13013 19193 61859
1 21404 9385 21763 52552
Sum 51057 22398 40956 114411
We can start to look at the odds for each category being successful:
filmOdds <- (21404 / 51057) / (29653 / 51057) # You can reduce this to 21404 / 29653
gamesOdds <- (9385 / 22398) / (13013 / 22398)
musicOdds <- (21763 / 40956) / (19193 / 40956)
filmOdds
[1] 0.7218157
gamesOdds
[1] 0.7212019
musicOdds
[1] 1.133903
And we can take each one back to a probability:
filmOdds / (1 + filmOdds)
[1] 0.4192177
gamesOdds / (1 + gamesOdds)
[1] 0.4190106
musicOdds / (1 + musicOdds)
[1] 0.5313751
We can take our probability formulation and divide 1 - the probability to recover our odds:
filmOdds / (1 + filmOdds) / (1 - (filmOdds / (1 + filmOdds)))
[1] 0.7218157
musicOdds / (1 + musicOdds) / (1 - (musicOdds / (1 + musicOdds)))
[1] 1.133903
When we consider what our odds mean now, we can envision them as the probability of the outcome occurring divided by the probability that it does not happen.
We could also compare those two directly:
(21763 / 19193) / (21404 / 29653)
[1] 1.570904
So the odds that a music campaign will be funded are about 57% higher than the odds for a film.
And we could also compare film to games:
(9385 / 13013) / (21404 / 29653)
[1] 0.9991497
Here, the odds that a game will be funded are less than 1% lower than the odds for a film.
Let’s run our logistic regression now:
categoryLogTest <- glm(state ~ main_category,
data = catData,
family = binomial)
summary(categoryLogTest)
Call:
glm(formula = state ~ main_category, family = binomial, data = catData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.231 -1.042 -1.042 1.319 1.319
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3259855 0.0089690 -36.346 <2e-16 ***
main_categoryGames -0.0008507 0.0162432 -0.052 0.958
main_categoryMusic 0.4516511 0.0133602 33.806 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 157849 on 114410 degrees of freedom
Residual deviance: 156517 on 114408 degrees of freedom
AIC: 156523
Number of Fisher Scoring iterations: 4
exp(coef(categoryLogTest))
(Intercept) main_categoryGames main_categoryMusic
0.7218157 0.9991497 1.5709038
filmOdds
[1] 0.7218157
The film group is about 28% less likely to be funded than everything else.
Bivariate relationships are really important for logistic models.
Your model needs to see some dispersion of values over the bivariate tables.
Logistic regression requires a larger sample size than what a linear regression needs.
\(R^2\) does not apply to a logistic regression.
There are many pseudo-\(R^2\), but they really do not mean the same thing as in linear regression.
You might be asked for them and many people present them.
Here is a great breakdown of different pseudo methods.
Remember our original model had some issues with separation. Now is a good time to try to fix that issue. An easy place to start might be to tidy up the range of backers.
Determine some cut-off values for backers and build a new model. See if you can take care of the separation issue and if you can improve upon the coefficient.
Poisson regression is for count-based dependent variables (e.g., how many touchdowns does a team get, how many safety violations does a factory have, how many credit cards do you have under your name)
If you have a lot of zeroes in your data, you might want to consider a zero-inflated Poisson.
If your count variable does not follow a Poisson distribution, then you might want to use a negative binomial model.
The Poisson distribution is very similar to the binomial distribution, but has some key differences. The biggest difference is in its parameter: Poisson has a single parameter noted as \(\lambda\). This rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season. We can find the rate by determining the number of events per interval, multiplied by the interval length.
\[\frac{\text{event}}{\text{interval}}*\text{interval length} \]
To put some numbers to that, if we have 1 accident per week in a factory and we are observing a whole year, we would get the following rate:
(1 / 7) * 28
[1] 4
We expect about 4 accidents over the course of a month.
Let’s see what that particular distribution might look like:
ggplot(data.frame(x = 0:20), aes(x)) +
geom_col(aes(y = dpois(x, (1 / 7) * 28)), fill = "#ff5500") +
theme_minimal()
We can also see what it looks like for different rates (some places might be safer than others):
ggplot() +
geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (1 / 7) * 28)),
mapping = aes(x, y), width = .97, alpha = .25, fill = "red") +
geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (3 / 7) * 28)),
mapping = aes(x, y), width = .97, alpha = .25, fill = "blue") +
geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (5 / 7) * 28)),
mapping = aes(x, y), width = .97, alpha = .25, fill = "green") +
theme_minimal()
This gets into an area where we need to think long and hard about our dependent variable and what it actually might be.
library(data.table)
library(vcd)
shroudData <- fread("https://www.nd.edu/~sberry5/data/shroudData.csv")
poissonTest <- goodfit(shroudData$shroudsProduced, type = "poisson")
summary(poissonTest)
Goodness-of-fit test for poisson distribution
X^2 df P(> X^2)
Likelihood Ratio 13.28809 14 0.5039751
This is a \(\chi^2\) to test if the distribution deviates from a Poisson. If we see a significant value, we would say that it deviates from the tested distribution. In this case, everything looks okay.
We can also plot that test using a hanging rootogram:
plot(poissonTest)
The bars are the observed counts and the red line/points are the fitted counts (i.e., how many would be expected). If a bar does not reach the 0 line, then the model would over-predict for that particular count; if the bar dips below the 0 line, the model under-predicts that count.
For models of this nature (our dependent variable is a count variable), we may have two different distributions with which to operate: the Poisson distribution or the negative binomial distribution.
Let’s check this out (it will be important later on!).
shroudData[, list(mean = mean(shroudsProduced),
var = var(shroudsProduced)),
by = employeeCount]
employeeCount mean var
1: 27 5.421875 4.882688
2: 17 5.144928 4.861040
3: 20 5.612903 4.765732
4: 22 5.052632 4.743860
5: 24 5.276923 4.484615
6: 30 8.250000 4.584507
7: 21 5.111111 4.494523
8: 19 5.298701 6.106972
9: 28 8.053571 4.524351
10: 16 5.354430 4.795846
11: 29 7.725806 4.267848
12: 15 4.906250 3.260913
13: 23 5.534247 5.891172
14: 25 4.802198 4.627106
15: 18 4.763636 4.739394
16: 26 5.293103 5.895039
What is the purpose of this? We are checking the conditional means and variances. Why is this important? If our variances are larger than our means, we have overdispersion. We would expect values to be equally distributed over levels, but if they are really spread out, this qualifies as overdispersion – this is not good for our Poisson model because it will cause downward bias.
As an overly simple check, we can also compare the mean and the variance of our outcome variable – they should be close to equal!
mean(shroudData$shroudsProduced)
[1] 5.684018
var(shroudData$shroudsProduced)
[1] 5.96405
Not terribly far off, but we will see if it becomes a problem later. You might be wondering why overdispersion is a problem – it goes back to the single parameter within the Poisson distribution. The normal distribution has a parameter for dealing with variance – \(\sigma\) – Poisson does not, so the assumption is that any variance would be equal to the mean.
Recall that every distribution has a link function (or several) that tend to work well for it. The poisson distribution uses a log link function:
\[y = Poisson(\lambda) \\ \text{log}(\lambda) = \alpha + \beta X\]
Using the log link keeps the outcome positive (we cannot deal with negative counts). Logs, as they are prone to do, are going to tend towards an exponential relationship; just be sure that it makes sense over the entire range of your data.
poissonTest = glm(shroudsProduced ~ employeeCount,
data = shroudData,
family = poisson)
summary(poissonTest)
Call:
glm(formula = shroudsProduced ~ employeeCount, family = poisson,
data = shroudData)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5346 -0.7647 -0.0033 0.5966 3.0804
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.084013 0.065469 16.56 <2e-16 ***
employeeCount 0.028771 0.002791 10.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1200.8 on 1094 degrees of freedom
Residual deviance: 1094.1 on 1093 degrees of freedom
AIC: 4926.3
Number of Fisher Scoring iterations: 4
exp(poissonTest$coefficients)
(Intercept) employeeCount
2.956519 1.029189
Important Note: We are going to interpret this almost the same as a linear regression. The slight wrinkle here, though, is that we are looking at the log counts (remember that we specified the log link function). In other words, an increase in one employee leads to an expected log count increase of ~.029. Just like our logisitc regression, we could exponentiate this to get 1.029189 – every employee we add gets us a ~3% increase in shrouds produced. Let’s see what this looks like in action:
shroudData = shroudData %>%
mutate(predValues = predict(poissonTest, type = "response"))
ggplot(shroudData, aes(employeeCount, predValues)) +
geom_count() +
scale_size_area() +
theme_minimal()
pchisq(poissonTest$deviance, poissonTest$df.residual, lower.tail = FALSE)
[1] 0.4848536
With everything coupled together, we have a meaningful coefficient, a clear plot, and adequate model fit. Therefore, we might conclude that there is a positive relationship between number of employees on shift and shrouds produced.
In addition to checking our data for over dispersion before running the model, we can also check it after running our model:
library(AER)
dispersiontest(poissonTest)
Overdispersion test
data: poissonTest
z = -1.3997, p-value = 0.9192
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
0.9452052
The dispersion value that we see returned (0.9452052 in our case) should be under 1. A dispersion value over 1 means that we have overdispersion. Our dispersion value, coupled with our high p-value, indicates that we would fail to reject the null hypothesis of equidispersion.
We can also look back to our model results to compare our residual deviance to our residual deviance degrees of freedom; if our deviance is greater than our degrees of freedom, we might have an issue with overdispersion. Since we are just a bit over and our overdispersion tests do not indicate any huge issue, we can be relatively okay with our model. If we had some more extreme overdispersion, we would want to flip to a quasi-poisson distribution – our coefficients would not change, but we would have improved standard errors.
Sometimes we have a seeming abundance of zero values within our data. We can have employees with zero absence periods, lines with zero quality failures, and days without safety issues. What is the process that generated the zeros? Are they coming from our count model (“true” zeroes) or something else (some random process)? This is where zero-inflated models become important. ZIP models are mixture models. We are not going to dive too deeply into this, but all you need to know is that a mixture model contains a “mixture” of different distributions:
\[y \sim \text{ZIP}(p,\lambda) \\ \text{logit} = \alpha_p + \beta_pX \\ \text{log} = \alpha_\lambda + \beta_\lambda X\]
This just formalizes the fact that our y can be generated from multiple processes.
redlights <- fread("https://www.nd.edu/~sberry5/data/redlights.csv")
poissonRedlight <- glm(citation_count ~ as.factor(camera_installed_ny),
data = redlights,
family = poisson)
summary(poissonRedlight)
Call:
glm(formula = citation_count ~ as.factor(camera_installed_ny),
family = poisson, data = redlights)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6361 -0.7131 -0.0428 0.6445 2.5624
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.29289 0.01455 157.59 <2e-16
as.factor(camera_installed_ny)1 -0.88529 0.02650 -33.41 <2e-16
(Intercept) ***
as.factor(camera_installed_ny)1 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2280.2 on 975 degrees of freedom
Residual deviance: 1059.8 on 974 degrees of freedom
AIC: 4581.8
Number of Fisher Scoring iterations: 4
With this output, we are comparing citation counts against intersections without a camera and those with a camera.
We see that our coefficient is -.88529 –- this means that having a camera leads to having .88529 less log counts than without having a camera.
We can also exponentiate that value to get an incident rate:
exp(poissonRedlight$coefficients)
(Intercept) as.factor(camera_installed_ny)1
9.9035639 0.4125961
Let’s see what those mean with an example:
tapply(X = redlights$citation_count,
INDEX = redlights$camera_installed_ny,
FUN = mean) # Old-school group_by and summarize!
0 1
9.903564 4.086172
4.086172/9.903564
[1] 0.4125961
Dividing the mean of the target group by the mean of the reference group will get us the incidence rate (i.e., for every one time someone runs a red light without a camera, it happens .41 times with camera).
If we take a look at citation_count’s distribution, we will see more than a few 0’s.
hist(redlights$citation_count)
For our redlight data, we saw that having a camera present had an effect on citations, but would it cause 0 citations? Or might there be something else contributing to the 0’s (e.g., no cars going through that intersection due to construction, no police nearby)? If there are no cars going through the intersection due to construction, is there even a chance of obtaining a non-zero response?
library(pscl)
zipTest <- zeroinfl(citation_count ~ as.factor(camera_installed_ny),
dist = "poisson", data = redlights)
# Important note: If you want your logistic model to differ from your
# poisson model, you would do this:
zipTest <- zeroinfl(citation_count ~
as.factor(camera_installed_ny) |
as.factor(camera_installed_ny),
dist = "poisson", data = redlights)
summary(zipTest)
Call:
zeroinfl(formula = citation_count ~ as.factor(camera_installed_ny) |
as.factor(camera_installed_ny), data = redlights, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-2.82923 -0.68433 -0.04265 0.66616 2.92554
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.29290 0.01455 157.59 <2e-16
as.factor(camera_installed_ny)1 -0.88528 0.02650 -33.41 <2e-16
(Intercept) ***
as.factor(camera_installed_ny)1 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -27.82 50417.30 -0.001 1
as.factor(camera_installed_ny)1 12.72 50417.48 0.000 1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 20
Log-likelihood: -2289 on 4 Df
Here we have two sets of coefficients: one for our count model (our actual Poisson regression) and one model attempting to account for excessive 0s. Our count model does not change, but we also see that our zero-inflation model will not account for the 0s within our data. This would be a clear indication that we have something else that might be contributing to the number of 0s that we see.
Remember that whole issue with our conditional means and variances? If we would have had problems those means and variances, we would need to abandon our Poisson distribution in favor of the negative binomial. The poisson distribution works when the sample mean and variance are equal – the negative binomial distribution frees that constraint and allows them to vary freely.
Remember this:
redlights[,
list(mean = mean(citation_count),
var = var(citation_count)),
by = camera_installed_ny]
camera_installed_ny mean var
1: 0 9.903564 11.083118
2: 1 4.086172 3.994567
Those look like the start of problems. Let’s check our whole sample now:
mean(redlights$citation_count)
[1] 6.929303
var(redlights$citation_count)
[1] 15.91602
There is clearly a problem here!
library(MASS)
nbTest <- glm.nb(citation_count ~ as.factor(camera_installed_ny),
data = redlights)
summary(nbTest)
Call:
glm.nb(formula = citation_count ~ as.factor(camera_installed_ny),
data = redlights, init.theta = 110.2437215, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5418 -0.6853 -0.0420 0.6157 2.4277
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.29289 0.01519 150.96 <2e-16
as.factor(camera_installed_ny)1 -0.88529 0.02719 -32.56 <2e-16
(Intercept) ***
as.factor(camera_installed_ny)1 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(110.2437) family taken to be 1)
Null deviance: 2148.6 on 975 degrees of freedom
Residual deviance: 998.6 on 974 degrees of freedom
AIC: 4581.6
Number of Fisher Scoring iterations: 1
Theta: 110.2
Std. Err.: 79.2
2 x log-likelihood: -4575.634
The interpretation of our negative binomial is exactly the same as our Poisson model – we have only relaxed the assumptions of our distribution! You might notice our model fits better (even if slightly so) by using the negative binomial.
The beta distribution has many practical uses. It can be used to model variables that are bound by 0 and 1 (but do not hit 0 or 1). If you have an outcome variable that is a proportion, it would be useful to keep the beta distribution in mind. We will also see the beta distribution in the context of Bayesian analysis, because we can use it as a prior for R^2.